x <- y
d <- read.csv('mydata.csv')
x = yR Workflow
This article outlines analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering the creation of annotated analysis files and running descriptive statistics on them with goals of understanding the data and the quality and completeness of the data. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement and to produce tabular and graphical statistical summaries. Several examples of processing and manipulating data using the data.table package are given. Finally, examples of caching results and doing parallel processing are presented.
Assignment Operator
You assign an R object to a value using the assignment operator <- or the equal sign. <- is read as “gets”.
Object Types
Everything in R is an object. Some primitive types of objects in R are below.
| Type | Meaning |
|---|---|
| integer | whole numbers |
| logical | values of TRUE or FALSE |
| double | floating point non-whole numbers |
| character | character strings |
| function | code defining a function |
In the table below, objects of different shapes are described. rows and cols refers to vectors of integers or logicals, or if the elements of the object are named, character strings.
| Type | Example | Values Retrieved By |
|---|---|---|
| scalar | x <- 3 |
x |
| vector | y <- c(1, 2, 5) |
y[2] (2), y[2:3] (2, 5), y[-1] (2, 5), y[c(TRUE,FALSE,TRUE)] (1) |
| named vector | y <- c(a=1, b=2, d=5) |
y[2] (2), y['b'] (2), y[c('a','b')] (1, 2) |
| matrix | y <- cbind(1:3, 4:5) |
y[rows,cols], y[rows,] (all cols), y[,cols] (all rows) |
| list | x <- list(a='cat', b=c(1,3,7)) |
x$a (‘cat’), x[[1]] (‘cat’), x[['a']] (‘cat’) |
Named vectors provide an extremely quick table lookup and recoding capability.
list objects are arbitrary trees and can have elements nested to any level. You can have lists of lists or lists of data frames/tables.
Vectors can be of many different types when a class is added to them. Two of the most common are Dates and factors. Character strings are handled very efficiently in R so there is not always a need to store categorical variables as factors. But there is one reason: to order levels, i.e., distinct variable values, so that tabular and graphical output will list values in a more logical order than alphabetic. A factor variable has a levels attribute added to it to accomplish this. An example is x <- factor(x, 1:3, c('cat', 'dog', 'fox')) where the second argument 1:3 is the vector of possible numeric values x currently takes on (in order) and the three character strings are the corresponding levels. Internally factors are coded as integers, but they print as character strings.
Rectangular data objects, i.e., when the number of rows is the same for every column (variable), can be represented by matrices, data.frames, and data.tables. In a matrix, every value is of the same type. A data.frame or a data.table is an R list that can have mixtures of numeric, character, factor, dates, and other object types. A data.table is also a data.frame but the converse isn’t true. data.tables are handled by the R data.table package and don’t have row names but can be indexed, are much faster to process, and have a host of methods implemented for aggregation and other operations. data.frames are handled by base R.
Subscripting
Examples of subscripting are given above. Subscripting via placement of [] after an object name is used for subsetting, and occasionally for using some elements more than once:
x <- c('cat', 'dog', 'fox')
x[2:3][1] "dog" "fox"
x[c(1, 1, 3, 3, 2)][1] "cat" "cat" "fox" "fox" "dog"
Subscripting a variable or a data frame/table by a vector of TRUE/FALSE values is a very powerful feature of R. This is used to obtain elements satisfying one or more conditions:
x <- c(1, 2, 3, 2, 1, 4, 7)
y <- c(1, 8, 2, 3, 8, 9, 2)
x[y > 7][1] 2 1 4
The last line of code can be read as “values of x such that y > 7”.
Branching and If/Then
Decisions Base on One Scalar Value
Common approaches to this problem are if and switch.
type <- 'semiparametric'
f <- switch(parametric = ols(y ~ x),
semiparametric = orm(y ~ x),
nonparametric = rcorr(x, y, type='spearman'),
{ z <- y / x
c(median=median(z), gmean=exp(mean(log(z))) } )
# The last 2 lines are executed for any type other than the 3 listed
f <- if(type == 'parametric') ols(y ~ x)
else
if(type == 'semiparametric') orm(y ~ x)
else
if(type == 'nonparametric') rcorr(x, y, type='spearman')
else {
z <- y / z
c(median=median(z), gmean=exp(mean(log(z)))
}What is inside if( ) must be a single scalar element that is evaluated to whether it’s TRUE or FALSE.
Series of Separate Decisions Over a Vector of Values
The ifelse function is most often used for this. Here’s an example.
x <- c('cat', 'dog', 'giraffe', 'elephant')
type <- ifelse(x %in% c('cat', 'dog'), 'domestic', 'wild')
type[1] "domestic" "domestic" "wild" "wild"
Functions
Even new R users can benefit from writing functions to reduce repetitive coding. A function has arguments and these can have default values for when the argument is not specified by the user when the function is called. Here are some examples. One line functions do not need to have their bodies enclosed in {}.
cuberoot <- function(x) x ^ (1/3)
cuberoot(8)[1] 2
g <- function(x, power=2) {
u <- abs(x - 0.5)
u / (1. + u ^ power)
}
g(3, power=2)[1] 0.3448276
g(3)[1] 0.3448276
# Function to make mean() drop missing values without our telling it
mn <- function(x) mean(x, na.rm=TRUE)
# Function to be used throughout the report to round fractional values
# by a default amount (here round to 0.001)
rnd <- function(x) round(x, 3)
# edit the 3 the change rounding anywhere in the report
# The following simple function saves coding when you need to recode multiple
# variables from 0/1 to no/yes.
yn <- function(x) factor(x, 0:1, c('no', 'yes'))File Directory Structure
I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.
Quarto I specify that html files are to be self-contained, so there are no separate graphics files.For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.
Analysis File Creation
I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table .rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds').
When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function.
describe and contents function outputs below and in axis labels for graphics.To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically. .q is like the Hmisc function Cs but also allows elements to be named. It will be in Hmisc 4.7-1.
.q <- function(...) {
s <- sys.call()[-1]
w <- as.character(s)
n <- names(s)
if(length(n)) names(w) <- n
w
}
.q(a, b, c, 'this and that')[1] "a" "b" "c" "this and that"
.q(dog=a, giraffe=b, cat=c) dog giraffe cat
"a" "b" "c"
Here is an upData example:
# Function to recode from atypical coding for yes/no in raw data
yn <- function(x) factor(x, 0:1, c('yes', 'no'))
d <-
upData(d,
rename = .q(gender=sex, any.event=anyEvent),
posSE = yn(posSE),
newMI = yn(newMI),
newPTCA = yn(newPTCA),
newCABG = yn(newCABG),
death = yn(death),
hxofHT = yn(hxofHT),
hxofDM = yn(hxofDM),
hxofCig = factor(hxofCig, c(0, 0.5, 1),
c('heavy', 'moderate', 'non-smoker')),
hxofMI = yn(hxofMI),
hxofPTCA = yn(hxofPTCA),
hxofCABG = yn(hxofCABG),
chestpain= yn(chestpain),
anyEvent = yn(anyEvent),
drop=.q(event.no, phat, mics, deltaEF,
newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp,
hardness),
labels=c(
bhr = 'Basal heart rate',
basebp = 'Basal blood pressure',
basedp = 'Basal Double Product bhr*basebp',
age = 'Age',
pkhr = 'Peak heart rate',
sbp = 'Systolic blood pressure',
dp = 'Double product pkhr*sbp',
dose = 'Dose of dobutamine given',
maxhr = 'Maximum heart rate',
pctMphr = 'Percent maximum predicted heart rate achieved',
mbp = 'Maximum blood pressure',
dpmaxdo = 'Double product on max dobutamine dose',
dobdose = 'Dobutamine dose at max double product',
baseEF = 'Baseline cardiac ejection fraction',
dobEF = 'Ejection fraction on dobutamine',
chestpain = 'Chest pain',
ecg = 'Baseline electrocardiogram diagnosis',
restwma = 'Resting wall motion abnormality on echocardiogram',
posSE = 'Positive stress echocardiogram',
newMI = 'New myocardial infarction',
newPTCA = 'Recent angioplasty',
newCABG = 'Recent bypass surgery',
hxofHT = 'History of hypertension',
hxofDM = 'History of diabetes',
hxofMI = 'History of myocardial infarction',
hxofCig = 'History of smoking',
hxofPTCA = 'History of angioplasty',
hxofCABG = 'History of coronary artery bypass surgery',
anyEvent = 'Death, newMI, newPTCA, or newCABG'),
units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',
pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,
mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
pctMphr='%', dose=mg, dobdose=mg)
)
saveRDS(d, 'stressEcho.rds', compress='xz')
# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:
for(x in names(d))
if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])The built-in function in R for reading .csv files is read.csv. The Hmisc package has a function csv.get which calls read.csv but offers some enhancements in variable naming, date handling, and reading variable labels from a specified row number. The fread function in the data.table package is blazing fast for reading large files and offers a number of options.
If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:
require(Hmisc)
getRs('importREDCap.r', put='source') # source() code to define function
mydata <- importREDCap() # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')
When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.
SAS, Stata, and SPSS binary files are converted to R data.frames using the R haven package. Here’s an example:
require(haven) # after you've installed the package
d <- read_sas('mydata.sas7bdat')
d <- read_xpt('mydata.xpt') # SAS transport files
d <- read_dta('mydata.dta') # Stata
d <- read_sav('mydata.sav') # SPSS
d <- read_por('mydata.por') # Older SPSS filesThese import functions carry variable labels into the data frame and convert dates and times appropriately. Character vectors are not converted to factors.
One of the most important principles to following in programming data analyses is to not do the same thing more than once. Repetitive code wastes time and is harder to maintain. One example of avoiding repetition is in reading a large number of files in R. If the files are stored in one directory and have a consistent file naming scheme (or you want to import every .csv file in the directory), one can avoid naming the individual files. The results may be stored in an R list that has named elements, and there are many processing tasks that can be automated by looping over this list.
In the following example assume that all the data files are .csv files in the current working directory, and they all have names of the form xz*.csv. Let’s read all of them and put each file into a data frame named by the characters in front of .csv. These data frames are stuffed into a list named X. The Hmisc csv.get function is used to read the files, automatically translating dates to Date variables, and because lowernames=TRUE is specified, variable names are translated to lower case. There is an option to fetch variable labels from a certain row of each .csv file but we are not using that.
files <- list.files(pattern='xz.*.csv') # vector of qualifying file names
# Get base part of *.csv
dnames <- sub('.csv', '', files)
X <- list()
i <- 0
for(f in files) {
cat('Reading file', f, '\n')
i <- i + 1
d <- csv.get(f, lowernames=TRUE)
# To read SAS, Stata, SPSS binary files use a haven function instead
# To convert to data.table do setDT(d) here
X[[dnames[i]]] <- d
}
saveRDS(X, 'X.rds', compress='xz') # Efficiently store all datasets togetherTo process one of the datasets one can do things like summary(X[[3]]) or summary(X$baseline) where the third dataset stored was named baseline because it was imported from baseline.csv.
Now there are many possibilities for processing all the datasets at once such as the following.
k <- length(X) # number of datasets
nam <- lapply(X, names) # list with k elements, each is a vector of names
# Get the union of all variable names used in any dataset
sort(unique(unlist(nam))) # all the variables appearing in any dataset
# Get variable names contained in all datasets
common <- names(X[[1]])
for(i in 2 : k) {
common <- intersect(common, names(X[[i]]))
if(! length(common)) break # intersection already empty
}
sort(common)
# Compute number of variables across datasets
nvar <- sapply(X, length) # or ncol
# Print number of observations per dataset
sapply(X, nrow)
# For each variable name count the number of datasets containing it
w <- data.table(dsname=rep(names(X), nvar), vname=unlist(nam))
w[, .N, keyby=vname]
# For each variable create a comma-separated list of datasets
# containing it
w[, .(datasets=paste(sort(dsname), collapse=', ')), keyby=vname]
# For datasets having a subject ID variable named id compute
# the number of unique ids
uid <- function(d) if('id' %in% names(d)) length(unique(d$id)) else NA
sapply(X, uid)
# To repeatedly analyze one of the datasets, extract it to a single data frame
d <- X$baseline
describe(d)The Hmisc package cleanup.import function improves imported data storage in a number of ways including converting double precision variables to integer when originally double but not containing fractional values (this halves the storage requirement). Hmisc::upData is the go-to function for annotating data frames/tables, renaming variables, and dropping variables. Hmisc::dataframeReduced removes problematic variables, e.g., those with a high fraction of missing values or that are binary with very small prevalence.
Variable Naming
I prefer short but descriptive variable names. As exemplified above, I use variable labels and units to provide more information. For example I wouldn’t name the age variable age.at.enrollment.yrs but would name it age with a label of Age at Enrollment and with units of years. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age) at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below).
Hmisc package graphics and table making functions such as summaryM and summary.formula specially typeset units in a smaller font.Data Dictionary
The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version of the dataset created above, which can be accessed with the Hmisc getHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.
stressEcho coded binary variables as 0/1 instead of N/Y.require(Hmisc)
getHdata(stressEcho)
d <- stressEchohtml(contents(d), levelType='table')Data frame:d
558 observations and 31 variables, maximum # NAs:0| Name | Labels | Units | Levels | Storage |
|---|---|---|---|---|
| bhr | Basal heart rate | bpm | integer | |
| basebp | Basal blood pressure | mmHg | integer | |
| basedp | Basal Double Product bhr*basebp | bpm*mmHg | integer | |
| pkhr | Peak heart rate | mmHg | integer | |
| sbp | Systolic blood pressure | mmHg | integer | |
| dp | Double product pkhr*sbp | bpm*mmHg | integer | |
| dose | Dose of dobutamine given | mg | integer | |
| maxhr | Maximum heart rate | bpm | integer | |
| pctMphr | Percent maximum predicted heart rate achieved | % | integer | |
| mbp | Maximum blood pressure | mmHg | integer | |
| dpmaxdo | Double product on max dobutamine dose | bpm*mmHg | integer | |
| dobdose | Dobutamine dose at max double product | mg | integer | |
| age | Age | years | integer | |
| gender | 2 | integer | ||
| baseEF | Baseline cardiac ejection fraction | % | integer | |
| dobEF | Ejection fraction on dobutamine | % | integer | |
| chestpain | Chest pain | integer | ||
| restwma | Resting wall motion abnormality on echocardiogram | integer | ||
| posSE | Positive stress echocardiogram | integer | ||
| newMI | New myocardial infarction | integer | ||
| newPTCA | Recent angioplasty | integer | ||
| newCABG | Recent bypass surgery | integer | ||
| death | integer | |||
| hxofHT | History of hypertension | integer | ||
| hxofDM | History of diabetes | integer | ||
| hxofCig | History of smoking | 3 | integer | |
| hxofMI | History of myocardial infarction | integer | ||
| hxofPTCA | History of angioplasty | integer | ||
| hxofCABG | History of coronary artery bypass surgery | integer | ||
| any.event | Death, newMI, newPTCA, or newCABG | integer | ||
| ecg | Baseline electrocardiogram diagnosis | 3 | integer |
| Variable | Levels |
|---|---|
| gender | male |
| female | |
| hxofCig | heavy |
| moderate | |
| non-smoker | |
| ecg | normal |
| equivocal | |
| MI |
You can write the text output of contents into a text file in your current working directory, and click on that file in the RStudio Files window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.
xless system command installed can pop up a contents window at any time by typing xless(contents(d)) in the console. xless is in Hmisc.capture.output(contents(d), file='contents.txt')Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.
cat(html(contents(d)), file='contents.html')
browseURL('contents.html', browser='vivaldi -new-window')RStudio provides a nice way to do this, facilitated by the htmlView function which may be fetched from Github using the Hmisc getRs function. htmlView takes any number of objects for which an html method exists to render them. They are rendered in the RStudio Viewer pane. If you are running outside RStudio, your default browser will be launched instead.
RStudio Viewer will drop its arrow button making it impossible to navigate back and forth to different html outputs.getRs('htmlView.r', put='source')
htmlView(contents(d))In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe function also shows labels, units, and levels).
htmlView.To be able to have multiple windows open to see information about datasets it is advantageous to open an external browser window. The htmlViewx function defined when you read htmlView.r will by default open a Vivaldi browser window with the first output put in a new window and all subsequent objects displayed as tabs within that same window. This behavior can be controlled with the tab argument, and you can use a different browser by issuing for example options(vbrowser='firefox'). As an example suppose that two datasets were read from the hbiostat.org/data data repository, and the data dictionary and descriptive statistics for both datasets were to be converted to html and placed in an external browser window for the duration of the R session.
firefox.exe. The tab names will not be correct until Hmisc 4.7-1 appears.getHdata(support)
getHdata(support2)
htmlViewx(contents(support), describe(support),
contents(support2), describe(support2))A screenshot of the result is here.
Descriptive Statistics
The Hmisc describe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary. Clicking on Glossary on the right will pop up a browser window with a more in-depth glossary of terms used in Hmisc package output. It links to hbiostat.org/R/glossary.html which you can link from your reports that use Hmisc..
Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.Glossary
Gmd in the output stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.
describe Output
w <- describe(d)
html(w)31 Variables 558 Observations
bhr: Basal heart rate bpm
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 69 | 0.999 | 75.29 | 16.57 | 54.0 | 58.0 | 64.0 | 74.0 | 84.0 | 95.3 | 102.0 |
basebp: Basal blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 94 | 0.998 | 135.3 | 23.35 | 104.0 | 110.0 | 120.0 | 133.0 | 150.0 | 162.3 | 170.1 |
basedp: Basal Double Product bhr*basebp bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 441 | 1 | 10181 | 2813 | 6607 | 7200 | 8400 | 9792 | 11663 | 13610 | 14770 |
pkhr: Peak heart rate mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 105 | 1 | 120.6 | 25.36 | 81.85 | 90.70 | 106.25 | 122.00 | 135.00 | 147.00 | 155.15 |
sbp: Systolic blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 142 | 1 | 146.9 | 40.72 | 96 | 102 | 120 | 141 | 170 | 200 | 210 |
dp: Double product pkhr*sbp bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 508 | 1 | 17634 | 5765 | 10256 | 11341 | 14033 | 17060 | 20644 | 24536 | 26637 |
dose: Dose of dobutamine given mg
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 7 | 0.84 | 33.75 | 8.334 |
Value 10 15 20 25 30 35 40 Frequency 2 28 47 56 64 61 300 Proportion 0.004 0.050 0.084 0.100 0.115 0.109 0.538
maxhr: Maximum heart rate bpm
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 103 | 1 | 119.4 | 24.64 | 82.0 | 91.0 | 104.2 | 120.0 | 133.0 | 146.0 | 154.1 |
pctMphr: Percent maximum predicted heart rate achieved %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 78 | 0.999 | 78.57 | 16.86 | 53 | 60 | 69 | 78 | 88 | 97 | 104 |
mbp: Maximum blood pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 132 | 0.999 | 156 | 35.03 | 110.0 | 120.0 | 133.2 | 150.0 | 175.8 | 200.0 | 211.1 |
dpmaxdo: Double product on max dobutamine dose bpm*mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 484 | 1 | 18550 | 5385 | 11346 | 12865 | 15260 | 18118 | 21239 | 24893 | 27477 |
dobdose: Dobutamine dose at max double product mg
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 8 | 0.941 | 30.24 | 10.55 |
Value 5 10 15 20 25 30 35 40 Frequency 7 7 55 73 71 78 62 205 Proportion 0.013 0.013 0.099 0.131 0.127 0.140 0.111 0.367
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
baseEF: Baseline cardiac ejection fraction %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 54 | 0.994 | 55.6 | 10.71 | 32 | 40 | 52 | 57 | 62 | 65 | 66 |
dobEF: Ejection fraction on dobutamine %
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 60 | 0.992 | 65.24 | 12.38 | 40.0 | 49.7 | 62.0 | 67.0 | 73.0 | 76.0 | 80.0 |
chestpain: Chest pain
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.64 | 172 | 0.3082 | 0.4272 |
restwma: Resting wall motion abnormality on echocardiogram
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.745 | 257 | 0.4606 | 0.4978 |
posSE: Positive stress echocardiogram
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.553 | 136 | 0.2437 | 0.3693 |
newMI: New myocardial infarction
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.143 | 28 | 0.05018 | 0.09549 |
newPTCA: Recent angioplasty
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.138 | 27 | 0.04839 | 0.09226 |
newCABG: Recent bypass surgery
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.167 | 33 | 0.05914 | 0.1115 |
death
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.123 | 24 | 0.04301 | 0.08247 |
hxofHT: History of hypertension
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.625 | 393 | 0.7043 | 0.4173 |
hxofDM: History of diabetes
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.699 | 206 | 0.3692 | 0.4666 |
hxofCig: History of smoking
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value heavy moderate non-smoker Frequency 122 138 298 Proportion 0.219 0.247 0.534
hxofMI: History of myocardial infarction
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.599 | 154 | 0.276 | 0.4004 |
hxofPTCA: History of angioplasty
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.204 | 41 | 0.07348 | 0.1364 |
hxofCABG: History of coronary artery bypass surgery
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.399 | 88 | 0.1577 | 0.2661 |
any.event: Death, newMI, newPTCA, or newCABG
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.402 | 89 | 0.1595 | 0.2686 |
ecg: Baseline electrocardiogram diagnosis
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value normal equivocal MI Frequency 311 176 71 Proportion 0.557 0.315 0.127
# To create a separate browser window:
cat(html(w), file='desc.html')
browseURL('desc.html', browser='firefox -new-window')Better, whether using RStudio or not:
htmlView(w, contents(d)) # or htmlView(describe(d1), describe(d2), ...)There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics.
p <- plot(w, bvspace=2.5)p$Categoricalp$ContinuousTo make creation of tabs in Quarto easier, here is a function that loops through the elements of a named list and outputs each element into a separate Quarto tab. A wide argument is used to expand the width of the output outside the usual margins. An initblank argument creates a first tab that is empty. This allows one to show nothing until one of the other tabs is clicked.
Github and can be defined using the Hmisc getRs function, i.e., getRs('maketabs.r', put='source').maketabs <- function(x, labels=names(x), wide=FALSE, initblank=FALSE) {
yaml <- paste0('.panel-tabset', if(wide) ' .column-page')
k <- c('', '::: {', yaml, '}', '')
if(initblank) k <- c(k, '', '## ', '')
.objmaketabs. <<- x
for(i in 1 : length(x)) {
cname <- paste0('c', round(100000 * runif(1)))
k <- c(k, '', paste('##', labels[i]), '',
paste0('```{r ', cname, ',results="asis",echo=FALSE}'),
paste0('.objmaketabs.[[', i, ']]'), '```', '')
}
k <- c(k, ':::', '')
cat(knitr::knit(text=knitr::knit_expand(text=k), quiet=TRUE))
}
maketabs(p)By specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.
options(grType='plotly')
p <- plot(w, bvspace=2.5)
maketabs(p, wide=TRUE)See this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).
initblank option to the maketabs function creates the initially opened tab, which has no content, thereby leaving it to the user to click one of the other tabs before output is shown.require(data.table)
setDT(d) # turn d into a data table
# tables() with no arguments will give a concise summary of all active data tables
w <- d
w[, hxofMI := factor(hxofMI, 0 : 1, c('No history of MI', 'History of MI'))]
vars <- setdiff(names(d), 'hxofMI')
form <- as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))
print(form)bhr + basebp + basedp + pkhr + sbp + dp + dose + maxhr + pctMphr + mbp + dpmaxdo + dobdose + age + gender + baseEF + dobEF + chestpain + restwma + posSE + newMI + newPTCA + newCABG + death + hxofHT + hxofDM + hxofCig + hxofPTCA + hxofCABG + any.event + ecg ~ hxofMI
s <- summaryM(form, data=d, test=TRUE)
w <- list('Table 1' = html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE),
'Categorical Variable Plot' = plot(s, which='categorical', vars=1 : 4, height=600, width=1000),
'Continuous Variable Plot' = plot(s, which='continuous', vars=1 : 4))
# initblank=TRUE: Put an empty tab first so that nothing initially displays
maketabs(w, wide=TRUE, initblank=TRUE)| Descriptive Statistics (N=558). | |||
| No history of MI N=404 |
History of MI N=154 |
Test Statistic |
|
|---|---|---|---|
Basal heart rate bpm |
65 74 85 | 63 72 84 | F1 556=1.41, P=0.2351 |
Basal blood pressure mmHg |
120 134 150 | 120 130 150 | F1 556=1.39, P=0.2381 |
Basal Double Product bhr*basebp bpm*mmHg |
8514 9874 11766 | 8026 9548 11297 | F1 556=3.32, P=0.0691 |
Peak heart rate mmHg |
107 123 136 | 104 120 132 | F1 556=2.35, P=0.1261 |
Systolic blood pressure mmHg |
124 146 174 | 115 134 158 | F1 556=12.1, P<0.0011 |
Double product pkhr*sbp bpm*mmHg |
14520 17783 21116 | 13198 15539 18885 | F1 556=15, P<0.0011 |
Dose of dobutamine given mg : 10 |
0.00 2⁄404 | 0.00 0⁄154 | χ26=8.77, P=0.1872 |
| 15 | 0.05 21⁄404 | 0.05 7⁄154 | |
| 20 | 0.10 40⁄404 | 0.05 7⁄154 | |
| 25 | 0.11 45⁄404 | 0.07 11⁄154 | |
| 30 | 0.11 43⁄404 | 0.14 21⁄154 | |
| 35 | 0.11 45⁄404 | 0.10 16⁄154 | |
| 40 | 0.51 208⁄404 | 0.60 92⁄154 | |
Maximum heart rate bpm |
107 122 134 | 102 118 130 | F1 556=4.05, P=0.0451 |
Percent maximum predicted heart rate achieved % |
69.0 78.0 89.0 | 70.0 77.0 87.5 | F1 556=0.5, P=0.4791 |
Maximum blood pressure mmHg |
138 154 180 | 130 142 162 | F1 556=13, P<0.0011 |
Double product on max dobutamine dose bpm*mmHg |
15654 18666 21664 | 14489 16785 19680 | F1 556=16.1, P<0.0011 |
Dobutamine dose at max double product mg : 5 |
0.01 4⁄404 | 0.02 3⁄154 | χ27=8.5, P=0.292 |
| 10 | 0.01 6⁄404 | 0.01 1⁄154 | |
| 15 | 0.11 43⁄404 | 0.08 12⁄154 | |
| 20 | 0.14 58⁄404 | 0.10 15⁄154 | |
| 25 | 0.13 54⁄404 | 0.11 17⁄154 | |
| 30 | 0.13 51⁄404 | 0.18 27⁄154 | |
| 35 | 0.10 40⁄404 | 0.14 22⁄154 | |
| 40 | 0.37 148⁄404 | 0.37 57⁄154 | |
Age years |
59.0 68.0 75.0 | 63.2 71.0 76.8 | F1 556=9.75, P=0.0021 |
| gender : female | 0.68 273⁄404 | 0.42 65⁄154 | χ21=30, P<0.0012 |
Baseline cardiac ejection fraction % |
55 59 63 | 40 54 60 | F1 556=56.4, P<0.0011 |
Ejection fraction on dobutamine % |
65.0 70.0 74.2 | 50.0 64.5 70.0 | F1 556=50.3, P<0.0011 |
| Chest pain | 0.29 119⁄404 | 0.34 53⁄154 | χ21=1.29, P=0.2572 |
| Resting wall motion abnormality on echocardiogram | 0.57 230⁄404 | 0.18 27⁄154 | χ21=69.7, P<0.0012 |
| Positive stress echocardiogram | 0.21 86⁄404 | 0.32 50⁄154 | χ21=7.56, P=0.0062 |
| New myocardial infarction | 0.03 14⁄404 | 0.09 14⁄154 | χ21=7.4, P=0.0072 |
| Recent angioplasty | 0.02 10⁄404 | 0.11 17⁄154 | χ21=17.8, P<0.0012 |
| Recent bypass surgery | 0.05 21⁄404 | 0.08 12⁄154 | χ21=1.35, P=0.2462 |
| death | 0.04 15⁄404 | 0.06 9⁄154 | χ21=1.23, P=0.2672 |
| History of hypertension | 0.69 280⁄404 | 0.73 113⁄154 | χ21=0.89, P=0.3462 |
| History of diabetes | 0.36 147⁄404 | 0.38 59⁄154 | χ21=0.18, P=0.6742 |
| History of smoking : heavy | 0.21 83⁄404 | 0.25 39⁄154 | χ22=3.16, P=0.2062 |
| moderate | 0.24 96⁄404 | 0.27 42⁄154 | |
| non-smoker | 0.56 225⁄404 | 0.47 73⁄154 | |
| History of angioplasty | 0.04 15⁄404 | 0.17 26⁄154 | χ21=28.4, P<0.0012 |
| History of coronary artery bypass surgery | 0.08 34⁄404 | 0.35 54⁄154 | χ21=59.6, P<0.0012 |
| Death, newMI, newPTCA, or newCABG | 0.12 48⁄404 | 0.27 41⁄154 | χ21=18.1, P<0.0012 |
| Baseline electrocardiogram diagnosis : normal | 0.59 240⁄404 | 0.46 71⁄154 | χ22=8, P=0.0182 |
| equivocal | 0.29 117⁄404 | 0.38 59⁄154 | |
| MI | 0.12 47⁄404 | 0.16 24⁄154 | |
|
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: 1Wilcoxon test; 2Pearson test . | |||
Semi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.
d[, histboxp(x=maxhr, group=ecg, bins=200)]Data Manipulation and Aggregation
The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a cheatsheet for data transformation is here. A master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.
Several data.table examples follow. I like to hold the current dataset in d to save typing. Some basic operations on data tables are:
d[2] # print 2nd row
d[2:4] # print rows 2,3,4
d[y > 2 & z > 3] # rows satisfying conditions
d[, age] # retrieve one variable
d[, .(age, gender)] # make a new table with two variables
i <- 2; d[, ..i] # get column 2
v <- 'age'; d[, ..v] # get variable named by contents of v
d[.N] # last row
d[, .N, keyby=age] # number of rows for each age, sorted
d[1:10, bhr:pkhr] # first 10 rows, variables bhr - pkhr
d[1:10, !(bhr:pkhr)] # all but those variables
d[, 2:4] # get columns 2-4Analyzing Selected Variables and Subsets
d[, html(describe(age))]| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
d[, html(describe(~ age + gender))]2 Variables 558 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
d[gender == 'female', html(describe(age))] # analyze age for females| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
html(describe(d[, .(age, gender)], 'Age and Gender Stats'))2 Variables 558 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
gender
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
# Separate analysis by female, male. Use keyby instead of by to sort the usual way.
d[, print(describe(age, descript=gender)), by=gender]male : Age [years]
n missing distinct Info Mean Gmd .05 .10
220 0 51 0.999 67.86 12.91 45.95 51.00
.25 .50 .75 .90 .95
62.00 69.00 75.00 81.00 86.00
lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years]
n missing distinct Info Mean Gmd .05 .10
338 0 58 0.999 67.01 13.74 47.00 50.70
.25 .50 .75 .90 .95
59.25 68.00 76.00 83.00 85.00
lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by gender
d[, .(Mean=mean(age), Median=median(age)), by=gender] gender Mean Median
1: male 67.85909 69
2: female 67.00888 68
# To create a new subset
w <- d[gender == 'female' & age < 70, ]Adding or Changing Variables
With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.
# Rename a variable
setnames(d, c('gender', 'height'),
c( 'sex', 'ht'))
# Easier:
setnames(d, .q(gender, height),
.q( sex, ht))
# Or
ren <- .q(gender=sex, height=ht)
setnames(d, names(ren), ren)
# Or
rename <- function(x, n) setnames(x, names(n), n)
rename(d, .q(gender=sex, height=ht))
# For all variables having baseline_ at the start of their names, remove it
n <- names(d)
setnames(d, n, sub('^baseline_', '', n)) # ^ = at start; use $ for at end
# For all variables having baseline_ at the start of their names, remove it
# and add Baseline to start of variable label
n <- names(d)
n <- n[grepl('^baseline_', n)]
ren <- sub('^baseline_', '', n); names(ren) <- n
# Fetch vector of labels for variables whose names start with baseline_
labs <- sapply(d, label)[n] # label() result is "" if no label
labs <- paste('Baseline', labs)
d <- updata(d, rename=ren, labels=labs)
# Change all names to lower case
n <- names(d)
setnames(d, n, tolower(n))
# Abbreviate names of all variables longer than 10 characters
# abbreviate() will ensure that all names are unique
n <- names(d)
setnames(d, n, abbreviate(n, minlength=10))
# For any variables having [...] or (...) in their labels, assume these
# are units of measurement and move them from the label to the units
# attribute
d <- upData(d, moveUnits=TRUE)
# Add two new variables, storing
result in a new data table
z <- d[, .(bmi=wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]
# Add one new variable in place
d[, bmi := wt / ht ^ 2]
# Add two new variables in place
d[, `:=`(bmi = wt / ht ^ 2, bas=0.016667 * sqrt(wt * ht))]
d[, .q(bmi, bsa) := .(wt / ht ^ 2, 0.016667 * sqrt(wt * ht))]
# Compute something requiring a different formula for different types
# of observations
d[, htAdj := ifelse(sex == 'male', ht, ht * 1.07)]
d[, htAdj := ht * ifelse(sex == 'male', 1, 1.07)]
d[, htAdj := (sex == 'male') * ht + (sex == 'female') * ht * 1.07]
d[sex == 'male', htAdj := ht]
d[sex == 'female', htAdj := ht * 1.07]
# Add label & optional units (better to use upData which works on data tables)
adlab <- function(x, lab, un='') {
label(x) <- lab
if(un != '') units(x) <- un
x
}
d[, maxhr := adlab(maxhr, 'Maximum Heart Rate', '/m')]
# Delete a variable (or use upData)
d[, bsa := NULL]
# Delete two variables
d[, `:=`(bsa=NULL, bmi=NULL)]
d[, .q(bsa, bmi) := NULL]Recoding Variables
# Group levels a1, a2 as a & b1,b2,b3 as b
d[, x := factor(x, .q(a1,a2,b1,b2,b3),
.q( a, a, b, b, b))]
# Regroup an existing factor variable
levels(d$x) <- list(a=.q(a1,a2), b=.q(b1,b2,b3))
# or
d <- upData(d, levels=list(x=list(a=.q(a1,a2), b=.q(b1,b2,b3))))
# Or manipulate character strings
d[, x := substring(x, 1, 1)] # replace x with first character of levels
# or
levels(d$x) <- substring(levels(d$x), 1, 1)
# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2
d[, x := 1 * (x %in% 1:3) + 2 * (x == 4)]
# Recode a series of conditions to a factor variable whose value is taken
# from the last condition that is TRUE using Hmisc::score.binary
# Result is a factor variable unless you add retfactor=FALSE
d[, x := score.binary(x1 == 'symptomatic',
x2 %in% .q(stroke, MI),
death)]
# Same but code with numeric points
d[, x := score.binary(x1 == 'symptomatic',
x2 %in% .q(stroke, MI),
death, # TRUE/FALSE or 1/0 variable
points=c(1,2,10))]
# Recode from one set of character strings to another using named vector
# A named vector can be subscripted with character strings as well as integers
states <- c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)
# Could also do:
# states <- .q(AL=Alabama, AK=Alaska, AZ=Arizona, ..., NM='New Mexico', ...)
# or do a merge for table lookup (see later)
d[, State := states[state])
# states are unique, state can have duplicates and all are recoded
# Recode from integers 1, 2, ..., to character strings
labs <- .q(elephant, giraffe, dog, cat)
d[, x := labs[x]]
# Recode from character strings to integers
d[, x := match(x, labs)]As an example of more complex hierarchical recoding let’s define codes in a nested list.
a <- list(plant =
list(vegetable = .q(spinach, lettuce, potato),
fruit = .q(apple, orange, banana, pear)),
animal =
list(domestic = .q(dog, cat, horse),
wild = .q(giraffe, elephant, lion, tiger)) )
a$plant
$plant$vegetable
[1] "spinach" "lettuce" "potato"
$plant$fruit
[1] "apple" "orange" "banana" "pear"
$animal
$animal$domestic
[1] "dog" "cat" "horse"
$animal$wild
[1] "giraffe" "elephant" "lion" "tiger"
a <- unlist(a)
aplant.vegetable1 plant.vegetable2 plant.vegetable3 plant.fruit1
"spinach" "lettuce" "potato" "apple"
plant.fruit2 plant.fruit3 plant.fruit4 animal.domestic1
"orange" "banana" "pear" "dog"
animal.domestic2 animal.domestic3 animal.wild1 animal.wild2
"cat" "horse" "giraffe" "elephant"
animal.wild3 animal.wild4
"lion" "tiger"
# Pick names of unlist'ed elements apart to define kingdom and type
n <- sub('[0-9]*$', '', names(a)) # remove sequence numbers from ends of names
# Names are of the form kingdom.type; split at .
s <- strsplit(n, '.', fixed=TRUE)
kingdom <- sapply(s, function(x) x[1])
type <- sapply(s, function(x) x[2])
names(kingdom) <- names(type) <- a
w <- data.table(kingdom, type, item=a, key=c('kingdom', 'item'))
w kingdom type item
1: animal domestic cat
2: animal domestic dog
3: animal wild elephant
4: animal wild giraffe
5: animal domestic horse
6: animal wild lion
7: animal wild tiger
8: plant fruit apple
9: plant fruit banana
10: plant vegetable lettuce
11: plant fruit orange
12: plant fruit pear
13: plant vegetable potato
14: plant vegetable spinach
# Example table lookups
cat(kingdom['dog'], ':', type['dog'], '\n')animal : domestic
kingdom[.q(dog, cat, spinach)] dog cat spinach
"animal" "animal" "plant"
type [.q(dog, cat, giraffe, spinach)] dog cat giraffe spinach
"domestic" "domestic" "wild" "vegetable"
# But what if there is a plant named the same as an animal?
# Then look up on two keys
w[.('animal', 'lion'), type][1] "wild"
Computing Summary Statistics
Many applications can use the automatically created data.table object .SD which stands for the data table for the current group being processed. If .SDcols were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as .SDcols to restrict the analysis. If there were no by variable(s), .SD stands for the whole data table.
# Compute the number of distinct values for all variables
nd <- function(x) length(unique(x))
d[, sapply(.SD, nd)] bhr basebp basedp pkhr sbp dp dose maxhr
69 94 441 105 142 508 7 103
pctMphr mbp dpmaxdo dobdose age gender baseEF dobEF
78 132 484 8 62 2 54 60
chestpain restwma posSE newMI newPTCA newCABG death hxofHT
2 2 2 2 2 2 2 2
hxofDM hxofCig hxofMI hxofPTCA hxofCABG any.event ecg
2 3 2 2 2 2 3
# Same but only for variables whose names contain hx and either D or M
d[, sapply(.SD, nd), .SDcols=patterns('hx', 'D|M')]hxofDM hxofMI
2 2
# Compute means on all numeric variables
mn <- function(x) mean(x, na.rm=TRUE)
d[, lapply(.SD, mn), .SDcols=is.numeric] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF chestpain
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194 0.3082437
restwma posSE newMI newPTCA newCABG death hxofHT
1: 0.4605735 0.2437276 0.05017921 0.0483871 0.05913978 0.04301075 0.7043011
hxofDM hxofPTCA hxofCABG any.event
1: 0.3691756 0.0734767 0.1577061 0.1594982
# Compute means on all numeric non-binary variables
nnb <- function(x) is.numeric(x) && length(unique(x)) > 2
d[, lapply(.SD, mn), .SDcols=nnb] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194
# Print frequency tables of all categorical variables with > 2 levels
cmult <- function(x) ! is.numeric(x) && length(unique(x)) > 2
tab <- function(x) {
z <- table(x, useNA='ifany')
paste(paste0(names(z), ': ', z), collapse=', ')
}
d[, lapply(.SD, tab), .SDcols=cmult] hxofCig
1: heavy: 122, moderate: 138, non-smoker: 298
ecg
1: normal: 311, equivocal: 176, MI: 71
Tabulate all variables having between 3 and 10 distinct values and create a side effect when data.table is running that makes the summarization function tab store all values and frequencies in a growing list Z so that kable can render a markdown table after we pad columns to the maximum length of all columns (maximum number of distinct values).
# Using <<- makes data.table have a side effect of augmenting Z and
# Align in the global environment
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
Z[[i+1]] <<- names(z)
Z[[i+2]] <<- as.vector(z)
Align <<- c(Align, if(is.numeric(x)) 'r' else 'l', 'r')
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z <- list(); Align <- character(0)
w <- d[, lapply(.SD, tab), .SDcols=discr]
maxl <- max(w)
# Pad shorter vectors with blanks
Z <- lapply(Z, function(x) c(x, rep('', maxl - length(x))))
Z <- do.call('cbind', Z) # combine all into columns of a matrix
colnames(Z) <- rep(names(w), each=2)
colnames(Z)[seq(2, ncol(Z), by=2)] <- 'Freq'
knitr::kable(Z, align=Align)| dose | Freq | dobdose | Freq | hxofCig | Freq | ecg | Freq |
|---|---|---|---|---|---|---|---|
| 10 | 2 | 5 | 7 | heavy | 122 | normal | 311 |
| 15 | 28 | 10 | 7 | moderate | 138 | equivocal | 176 |
| 20 | 47 | 15 | 55 | non-smoker | 298 | MI | 71 |
| 25 | 56 | 20 | 73 | ||||
| 30 | 64 | 25 | 71 | ||||
| 35 | 61 | 30 | 78 | ||||
| 40 | 300 | 35 | 62 | ||||
| 40 | 205 |
A better approach is to let the kables function put together a series of separate markdown tables of different sizes. By using the “updating Z in the global environment” side effect we are able to let data.table output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
w <- matrix(cbind(names(z), z), ncol=2,
dimnames=list(NULL, c(vnames[i+1], 'Freq')))
Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r' else 'l', 'r'))
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z <- list()
vnames <- names(d[, .SD, .SDcols=discr])
w <- d[, lapply(.SD, tab), .SDcols=discr]
knitr::kables(Z)
|
|
|
|
Use a similar side-effect approach to get separate html describe output by gender.
g <- function(x, by) {
Z[[length(Z) + 1]] <<- describe(x, descript=paste('age for', by))
by
}
Z <- list()
by <- d[, g(age, gender), by=gender]
# Make Z look like describe() output for multiple variables
class(Z) <- 'describe'
attr(Z, 'dimensions') <- c(nrow(d), nrow(by))
attr(Z, 'descript') <- 'Age by Gender'
html(Z)2 Variables 558 Observations
age for male: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 0 | 51 | 0.999 | 67.86 | 12.91 | 45.95 | 51.00 | 62.00 | 69.00 | 75.00 | 81.00 | 86.00 |
age for female: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
# Compute a 1-valued statistic on multiple variables, by cross-classification
# of two variables. Do this on a subset. .SDcols=a:b uses variables in order
# Use keyby instead of by to order output the usual way
d[age < 70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp] gender newMI pkhr sbp dp
1: male 0 122.0561 147.7664 17836.24
2: male 1 115.6667 140.6667 16437.67
3: female 0 122.8492 150.5084 18509.03
4: female 1 123.5714 171.5714 21506.71
# Compute multiple statistics on one variable
# Note: .N is a special variable: count of obs for current group
d[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)] gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
# Same result another way
g <- function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))
d[, g(bhr), by=.(gender, newMI)] # if g returned a vector instead, use as.list(g(bhr)) gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
d[, as.list(quantile(bhr)), by=gender] gender 0% 25% 50% 75% 100%
1: male 42 63 72 83 127
2: female 45 65 75 85 210
# Compute mean bhr by quintiles of age using Hmisc::cut2
# Bad statistical practice; use scatterplot + smoother instead
d[, .(Mean=mean(bhr)), keyby=.(fifth=cut2(age, g=5))] fifth Mean
1: [26,60) 78.54545
2: [60,67) 76.96907
3: [67,72) 74.26316
4: [72,78) 75.92793
5: [78,93] 70.03846
# Compute multiple statistics on multiple variables
d[, lapply(.SD, quantile), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 42 52.00 80.00
2: male 63 106.75 120.00
3: male 72 123.00 140.00
4: male 83 136.00 166.00
5: male 127 176.00 250.00
6: female 45 61.00 40.00
7: female 65 106.25 122.25
8: female 75 122.00 141.50
9: female 85 134.00 170.00
10: female 210 210.00 309.00
# Similar but put percentile number in front of statistic value
# Do only quartiles
g <- function(x) {
z <- quantile(x, (1:3)/4, na.rm=TRUE)
paste(format(names(z)), format(round(z)))
}
d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 25% 63 25% 107 25% 120
2: male 50% 72 50% 123 50% 140
3: male 75% 83 75% 136 75% 166
4: female 25% 65 25% 106 25% 122
5: female 50% 75 50% 122 50% 142
6: female 75% 85 75% 134 75% 170
# To have more control over labeling and to have one row per sex:
g <- function(x) {
s <- sapply(x, quantile, na.rm=TRUE) # compute quantiles for all variables -> matrix
h <- as.list(s) # vectorizes first
# Cross row names (percentile names) with column (variable) names
# paste(b, a) puts variable name in front of percentile
names(h) <- outer(rownames(s), colnames(s), function(a, b) paste(b, a))
h
}
d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% pkhr 0% pkhr 25% pkhr 50%
1: male 42 63 72 83 127 52 106.75 123
2: female 45 65 75 85 210 61 106.25 122
pkhr 75% pkhr 100% sbp 0% sbp 25% sbp 50% sbp 75% sbp 100%
1: 136 176 80 120.00 140.0 166 250
2: 134 210 40 122.25 141.5 170 309
# Restrict to variables bhr - basedp in order columns created in data table
d[, g(.SD), by=gender, .SDcols=bhr : basedp] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% basebp 0% basebp 25%
1: male 42 63 72 83 127 85 120.00
2: female 45 65 75 85 210 90 122.25
basebp 50% basebp 75% basebp 100% basedp 0% basedp 25% basedp 50% basedp 75%
1: 130 145 203 5220 7976.25 9483 11183.50
2: 136 150 201 5000 8562.00 10063 11891.25
basedp 100%
1: 16704
2: 27300
# Can put ! in front of a sequence of variables to do the opposite
# To add duplicated means to raw data use e.g.
# d[, Mean := mean(x), by=sex]Merging Data
Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
y = 1 : 12, key='id')
b id age
1: 1 21
2: 2 28
3: 3 32
4: 4 23
L id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)] # Keeps all ids in L (left inner join) id age day y
1: 2 28 1 1
2: 2 28 2 2
3: 2 28 3 3
4: 3 32 1 4
5: 3 32 2 5
6: 3 32 3 6
7: 3 32 4 7
8: 4 23 1 8
9: 4 23 2 9
10: 5 NA 1 10
11: 5 NA 2 11
12: 5 NA 3 12
L[b, on=.(id)] # Keeps all ids in b (right inner join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
L[b, on=.(id), nomatch=NULL] # Keeps only ids in both b and L (right outer join) id day y age
1: 2 1 1 28
2: 2 2 2 28
3: 2 3 3 28
4: 3 1 4 32
5: 3 2 5 32
6: 3 3 6 32
7: 3 4 7 32
8: 4 1 8 23
9: 4 2 9 23
uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]] # Keeps ids in either b or c (full outer join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
11: 5 1 10 NA
12: 5 2 11 NA
13: 5 3 12 NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table id age day y
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
For very large data tables, giving the data tables keys will speed execution, e.g.:
setkey(d, id)
setkey(d, state, city)
Join/merge can be used for data lookups:
s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL] st y State
1: AL 5 Alabama
2: AK 4 Alaska
3: AZ 3 Arizona
4: CA 2 California
5: OK 1 Oklahoma
Reshaping Data
To reshape data from long to wide format, take the L data table above as an example.
w <- dcast(L, id ~ day, value.var='y')
w id 1 2 3 4
1: 2 1 2 3 NA
2: 3 4 5 6 7
3: 4 8 9 NA NA
4: 5 10 11 12 NA
# Now reverse the procedure
m <- melt(w, id.vars='id', variable.name='day', value.name='y')
m id day y
1: 2 1 1
2: 3 1 4
3: 4 1 8
4: 5 1 10
5: 2 2 2
6: 3 2 5
7: 4 2 9
8: 5 2 11
9: 2 3 3
10: 3 3 6
11: 4 3 NA
12: 5 3 12
13: 2 4 NA
14: 3 4 7
15: 4 4 NA
16: 5 4 NA
setkey(m, id, day) # sorts
m[! is.na(y)] id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
Other Manipulations of Longitudinal Data
For the longitudinal data table L carry forward to 4 days the subject’s last observation on y if it was assessed earlier than day 4.
w <- copy(L) # fresh start with no propagation of changes back to L
# Only needed if will be using := to compute variables in-place and
# you don't want the new variables also added to L
# This is related to data.table doing things by reference instead of
# making copies. w <- L does not create new memory for w.
# Compute each day's within-subject record number and last record number
# Feed this directly into a data.table operation to save last records
# when the last is on a day < 4
u <- w[, .q(seq, maxseq) := .(1 : .N, .N), by=id][seq == maxseq & day < 4,]
# Extra observations to fill out to day 4
u <- u[, .(day = (day + 1) : 4, y = y), by=id]
u id day y
1: 2 4 3
2: 4 3 9
3: 4 4 9
4: 5 4 12
w <- rbind(L, u, fill=TRUE)
setkey(w, id, day) # sort and index
w id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 2 4 3
5: 3 1 4
6: 3 2 5
7: 3 3 6
8: 3 4 7
9: 4 1 8
10: 4 2 9
11: 4 3 9
12: 4 4 9
13: 5 1 10
14: 5 2 11
15: 5 3 12
16: 5 4 12
Find the first time at which y >= 3 and at which y >= 7.
day[y >= 3] is read as “the value of day when y >= 3”. It is a standard subscripting operation in R for two parallel vectors day and y. Taking the minimum value of day satisfying the condition gives us the first qualifying day.L[, .(first3 = min(day[y >= 3]),
first7 = min(day[y >= 7])), by=id] id first3 first7
1: 2 3 Inf
2: 3 1 4
3: 4 1 1
4: 5 1 1
Same but instead of resulting in an infinite value if no observations for a subject meet the condition, make the result NA.
mn <- function(x) if(length(x)) min(x) else as.double(NA)
# as.double needed because day is stored as double precision
# (type contents(L) to see this) and data.table requires
# consistent storage types
L[, .(first3 = mn(day[y >= 3]),
first7 = mn(day[y >= 7])), by=id] id first3 first7
1: 2 3 NA
2: 3 1 4
3: 4 1 1
4: 5 1 1
Add a new variable z and compute the first day at which z is above 0.5 for two days in a row for the subject. Note that the logic below looks for consecutive days for which records exist for a subject. To also require the days to be one day apart add the clause day == shift(day) + 1 after shift(z) > 0.5.
set.seed(1)
w <- copy(L)
w[, z := round(runif(.N), 3)]
u <- copy(w)
u id day y z
1: 2 1 1 0.266
2: 2 2 2 0.372
3: 2 3 3 0.573
4: 3 1 4 0.908
5: 3 2 5 0.202
6: 3 3 6 0.898
7: 3 4 7 0.945
8: 4 1 8 0.661
9: 4 2 9 0.629
10: 5 1 10 0.062
11: 5 2 11 0.206
12: 5 3 12 0.177
mn <- function(x)
if(! length(x) || all(is.na(x))) as.double(NA) else min(x, na.rm=TRUE)
u[, consecutive := z > 0.5 & shift(z) > 0.5, by=id][,
firstday := mn(day[consecutive]), by=id]
u id day y z consecutive firstday
1: 2 1 1 0.266 FALSE NA
2: 2 2 2 0.372 FALSE NA
3: 2 3 3 0.573 FALSE NA
4: 3 1 4 0.908 NA 4
5: 3 2 5 0.202 FALSE 4
6: 3 3 6 0.898 FALSE 4
7: 3 4 7 0.945 TRUE 4
8: 4 1 8 0.661 NA 2
9: 4 2 9 0.629 TRUE 2
10: 5 1 10 0.062 FALSE NA
11: 5 2 11 0.206 FALSE NA
12: 5 3 12 0.177 FALSE NA
In general, using methods that involve counters makes logic more clear, easier to incrementally debug, and easier to extend the condition to any number of consecutive times. Create a function that computes the number of consecutive TRUE values or ones such that whenever the sequence is interrupted by FALSE or 0 the counting starts over. As before we compute the first day at which two consecutive z values exceed 0.5.
nconsec is modified from code found here.nconsec <- function(x) x * sequence(rle(x)$lengths)
# rle in base R: run length encoding
# Example:
x <- c(0,0,1,1,0,1,1,0,1,1,1,1)
nconsec(x) [1] 0 0 1 2 0 1 2 0 1 2 3 4
# To require that the time gap between measurements must be <= 2 time
# units use the following example
t <- c(1:9, 11, 14, 15)
rbind(t=t, x=x) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
t 1 2 3 4 5 6 7 8 9 11 14 15
x 0 0 1 1 0 1 1 0 1 1 1 1
nconsec(x & t <= shift(t) + 2) [1] 0 0 1 2 0 1 2 0 1 2 0 1
u <- copy(w)
# nconsec(z > 0.5) = number of consecutive days (counting current
# day) for which the subject had z > 0.5
u[, firstday := mn(day[nconsec(z > 0.5) == 2]), by=id]
# | | | |
# minimum | | |
# day | |
# such that |
# it's the 2nd consecutive day with z > 0.5
u id day y z firstday
1: 2 1 1 0.266 NA
2: 2 2 2 0.372 NA
3: 2 3 3 0.573 NA
4: 3 1 4 0.908 4
5: 3 2 5 0.202 4
6: 3 3 6 0.898 4
7: 3 4 7 0.945 4
8: 4 1 8 0.661 2
9: 4 2 9 0.629 2
10: 5 1 10 0.062 NA
11: 5 2 11 0.206 NA
12: 5 3 12 0.177 NA
Graphics
I make heavy use of ggplot2, plotly and R base graphics. plotly is used for interactive graphics, and the R plotly package provides an amazing function ggplotly to convert a static ggplot2 graphics object to an interactive plotly one. If the user goes to the trouble of adding labels for graphics entities (usually points, lines, curves, rectangles, and circles) those labels can become hover text in plotly without disturbing anything in static graphics. As shown here you can sense whether an html or pdf report is being produced, and for html all ggplot2 objects can be automatically transformed to plotly.
ggplotly extra text appears in front of labels, but the result of ggplotly can be run through Hmisc::ggplotlyr to remove this as shown in the example.Many types of graphs can be created with base graphics, e.g. hist(age, nclass=50) or Ecdf(age) but using ggplot2 for even simple graphics makes it easy to add handle multiple groups on one graph or to create multiple panels for strata using faceting. ggplot2 has excellent default font sizes and axis labeling that works for most sizes of plots.
Here is a prototypical ggplot2 example illustrating many of the features I most often use. Ignore the ggplot2 label attribute if not using plotly. Options are given to the Hmisc label function so that it will retrieve the variable label and units (if present) and format them for axis labels or tables.
# Create a vector of formatted labels for all variables in data
vlabs <- sapply(d, label, plot=TRUE, html=TRUE)
# This will format OK for plotly since plotly uses HTML
# It will not look good without plotly so leave off html=TRUE to
# make label use R's plotmath syntax for regular ggplot2
# Define substitutes for xlab and ylab that look up our
# constructed labels.
# Could instead directly use xlab(vlabs['age'])
labx <- function(v) xlab(vlabs[[as.character(substitute(v))]])
laby <- function(v) ylab(vlabs[[as.character(substitute(v))]])
g <-
ggplot(d, aes(x=age, y=bhr, color=gender, label=paste0('dose:', dose))) +
geom_point() + geom_smooth() +
guides(color=guide_legend(title='')) +
theme(legend.position='bottom') + # not respected by ggplotly
labs(caption='Scatterplot of age by basal heart rate stratified by sex') +
labx(age) + laby(bhr)
# or just xlab('Age in years') + ylab('Basal heart rate')
ggplotlyr(g, remove='.*): ') # removes paste0("dose:", dose): # dose is in hover text for each pointFor large datasets the Hmisc package has a function ggfreqScatter that makes it easy to see overlapping points by color coding the frequency of points in each small bin. That way scatterplots scale to very large datasets. Here is an example:
html=TRUE was needed because otherwise axis labels are formatted using R’s plotmath and plotly doesn’t like that.set.seed(1)
x <- round(rnorm(2000), 1)
y <- 2 * (x > 1.5) + round(rnorm(2000), 1)
z <- sample(c('a', 'b'), 2000, replace=TRUE)
label(x) <- 'X Variable' # could use xlab() &
label(y) <- 'Y Variable' # ylab() in ggfreqScatter()
g <- ggfreqScatter(x, y, by=z, html=TRUE)
# If variables were inside a data table use
# g <- d[, ggfreqScatter(x, y, by=z, html=TRUE)]
gNow convert the graphic to plotly.
ggplotlyr(g)When you hover the mouse over a point, its frequency pops up.
Many functions in the Hmisc and rms packages produce plotly graphics directly. One of the most unique functions is dotchartpl.
Analysis
Formatting
For analysis the sky is the limit. I take advantage of special formatting for model fit objects from the rms package by using html or latex methods and putting results='asis' in the chunk header to preserve the special formatting. Here is an example.
require(rms)
options(prType='html') # needed to use special formatting (can use prType='latex')
set.seed(1)
n <- 1000
w <- data.table(x=runif(n), x2=rnorm(n), y=sample(0:1, n, replace=TRUE))
dd <- datadist(w); options(datadist='dd') # rms needs for summaries and plotting
f <- lrm(y ~ x + rcs(x2, 4), data=w)
fLogistic Regression Model
lrm(formula = y ~ x + rcs(x2, 4), data = w)
|
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 1000 | LR χ2 1.91 | R2 0.003 | C 0.526 |
| 0 492 | d.f. 4 | R24,1000 0.000 | Dxy 0.051 |
| 1 508 | Pr(>χ2) 0.7516 | R24,749.8 0.000 | γ 0.051 |
| max |∂log L/∂β| 2×10-14 | Brier 0.249 | τa 0.026 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.1079 | 0.2976 | 0.36 | 0.7169 |
| x | 0.1506 | 0.2197 | 0.69 | 0.4931 |
| x2 | 0.0502 | 0.2142 | 0.23 | 0.8147 |
| x2’ | -0.3705 | 0.6631 | -0.56 | 0.5763 |
| x2’’ | 1.3624 | 2.6219 | 0.52 | 0.6033 |
anova(f)
Wald Statistics for y
|
|||
| χ2 | d.f. | P | |
|---|---|---|---|
| x | 0.47 | 1 | 0.4931 |
| x2 | 1.44 | 3 | 0.6961 |
| Nonlinear | 0.33 | 2 | 0.8499 |
| TOTAL | 1.91 | 4 | 0.7523 |
Caching
The workhorse behind Rmarkdown and Quarto is knitr, which processes the code chunks and properly mingles code and tabular and graphical output. knitr has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the runifChanged function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with .rds appended).
# Read the source code for the hashCheck and runifChanged functions from
# https://github.com/harrelfe/rscripts/blob/master/hashCheck.r
getRs('hashCheck.r', put='source')
g <- function() {
# Fit a logistic regression model and bootstrap it 500 times, saving
# the matrix of bootstrapped coefficients
f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)
bootcov(f, B=500)
}
set.seed(3)
n <- 2000
dat <- data.table(x1=runif(n), x2=runif(n),
y=sample(0:1, n, replace=TRUE))
# runifChanged will write runifch.rds if needed (chunk name.rds)
# Will run if dat or source code for lrm or bootcov change
b <- runifChanged(g, dat, lrm, bootcov)
dim(b$boot.Coef)[1] 500 3
head(b$boot.Coef) Intercept x1 x2
[1,] 0.02007292 -0.30079958 0.32416398
[2,] 0.06150624 -0.35741054 0.25522669
[3,] 0.25225861 -0.40094541 0.09290729
[4,] 0.13766665 -0.48661991 0.19684403
[5,] -0.22018456 0.02132711 0.33973578
[6,] 0.18217417 -0.36140896 -0.04873320
Parallel Computing
The runParallel function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. runParallel makes the parallel package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number seed is given, and the seed is set to this plus i for core number i. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. runifChanged is again used, to avoid running the simulations if no inputs have changed.
# Load runParallel function from github
getRs('runParallel.r', put='source')
# Function to do simulations on one core
run1 <- function(reps, showprogress, core) {
cof <- matrix(NA, nrow=reps, ncol=3,
dimnames=list(NULL, .q(a, b1, b2)))
for(i in 1 : reps) {
y <- sample(0:1, n, replace=TRUE)
f <- lrm(y ~ X)
cof[i, ] <- coef(f)
}
list(coef=cof)
}
# Debug one core run, with only 3 iterations
n <- 300
seed <- 3
set.seed(seed)
X <- cbind(x1=runif(n), x2=runif(n)) # condition on covariates
run1(3)$coef
a b1 b2
[1,] -0.5455330 0.9572896 0.2215974
[2,] -0.2459193 0.3081405 0.1284809
[3,] -0.1391344 -0.2668562 0.1792319
nsim <- 5000
g <- function() runParallel(run1, reps=nsim, seed=seed)
Coefs <- runifChanged(g, X, run1, nsim, seed)
dim(Coefs)[1] 5000 3
apply(Coefs, 2, mean) a b1 b2
0.0020121803 -0.0007277216 -0.0003258610
Simulation
Some of the best ways to validate an analysis are
- If using any model/feature selection methods use the bootstrap to check whether the selection process is volatile, e.g., your sample size isn’t large enough too support making hard-and-fast selections of predictors/features
- Use Monte Carlo simulation to check if the correct model or correct predictors are usually selected
- Simulate a large dataset under a known model and known parameter values and make sure the estimation process you use can recover the true parameter values
- Simulate the statistical performance of a method under a variety of conditions
Unlike static papers in the literature, simulation can study the performance of a method under conditions that mimic your situation.
When simulating performance of various quantities under various conditions, creating a large number of variables makes the code long and tedious. It is better to to use data frames/tables or arrays to hold everything together. Data frames and arrays also lead to efficient graphics code for summarization.
Data Table Approach
The expand.grid function is useful for generating all combinations of simulation conditions. Suppose we wanted to simulate statistical properties of the maximum absolute value of the sample correlation coefficient from a matrix of all pairwise correlations from truly uncorrelated variables. We do this while varying the sample size n, the number of variables p, and the type of correlation (Pearson’s or Spearman’s, denoted by r and rho). With expand.grid we don’t need a lot of nested for loops. Run 500 simulations for each condition.
nsim <- 500
R <- expand.grid(n=c(10, 20, 50, 100),
p=c(2, 5, 10, 20),
sim=1 : nsim)
setDT(R)
set.seed(1)
for(i in 1 : nrow(R)) { # took 4s
w <- R[i, ]
n <- w$n
p <- w$p
X <- matrix(rnorm(n * p), ncol=p)
cors <- cor(X)
maxr <- max(abs(cors[row(cors) < col(cors)])) # use upper triangle
cors <- cor(X, method='spearman')
maxrho <- max(abs(cors[row(cors) < col(cors)]))
set(R, i, 'maxr', maxr) # set is in data.table & is very fast
set(R, i, 'maxrho', maxrho) # set will create the variable if needed
# If not using data.table use this slower approach:
# R[i, 'maxr'] <- maxr etc.
}The simulations could have been cached or parallelized as discussed above.
Compute the mean (over simulations) maximum correlation (over variables) and plot the results.
w <- R[, .(maxr = mean(maxr), maxrho=mean(maxrho)), by=.(n, p)]
# Make data table taller and thinner to put r, rho as different observations
u <- melt(w, id.vars=c('n', 'p'), variable.name='type', value.name='r')
u[, type := substring(type, 4)] # remove "max"
ps <- c(2, 5, 10, 20)
u[, p := factor(p, ps, paste0('p:', ps))]
g <- ggplot(u, aes(x=n, y=r, col=type)) + geom_jitter(height=0, width=2) +
ylim(0, 1) +
facet_wrap(~ p) +
guides(color=guide_legend(title='')) +
ylab('Mean Maximum Correlation Coefficient')
plotly::ggplotly(g)Array Approach
For large problems, storing results in R arrays is more efficient and doesn’t require duplication of values of n and p over simulations. Once the array is created it can be converted into a data table for graphing.
nsim <- 500
ns <- c(10, 20, 50, 100)
ps <- c(2, 5, 10, 20)
R <- array(NA, dim=c(nsim, length(ns), length(ps), 2),
dimnames=list(NULL,
n = as.character(ns),
p = as.character(ps),
type = c('r', 'rho')))
dim(R)[1] 500 4 4 2
dimnames(R)[[1]]
NULL
$n
[1] "10" "20" "50" "100"
$p
[1] "2" "5" "10" "20"
$type
[1] "r" "rho"
set.seed(1)
for(i in 1 : nsim) { # took 1s
for(n in ns) {
for(p in ps) {
X <- matrix(rnorm(n * p), ncol=p)
for(type in c('r', 'rho')) {
cors <- cor(X, method=switch(type, r = 'pearson', rho = 'spearman'))
# or method=c(r='pearson', rho='spearman')[type]
# or method=.q(r=pearson, rho=spearman)[type]
# or method=if(type == 'r') 'pearson' else 'spearman'
# or method=ifelse(type == 'r', 'pearson', 'spearman')
maxr <- max(abs(cors[row(cors) < col(cors)]))
R[i, as.character(n), as.character(p), type] <- maxr
}
}
}
}
# Compute mean (over simulations) maximum correlation for each condition
m <- apply(R, 2:4, mean) # preserve dimensions 2,3,4 summarize over 1
# Convert the 3-dimensional array to a tall and thin data table
# Generalizations of row() and col() used for 2-dimensional matrices
# comes in handy: slice.index
dn <- dimnames(m)
u <- data.table(r = as.vector(m),
n = as.numeric(dn[[1]])[as.vector(slice.index(m, 1))],
p = as.numeric(dn[[2]])[as.vector(slice.index(m, 2))],
type = dn[[3]][as.vector(slice.index(m, 3))])
# Plot u using same ggplot code as aboveComputing Environment
The following output is created by the command markupSpecs$html$session(), where markupSpecs is defined in the Hmisc package.
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 21.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rms_6.3-1 SparseM_1.81 data.table_1.13.6 Hmisc_4.7-1 [5] ggplot2_3.3.3 Formula_1.2-4 survival_3.2-13 lattice_0.20-45To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-1, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.
To cite the data.table package in publications use:Dowle M, Srinivasan A (2020). data.table: Extension of 'data.frame'. R package version 1.13.6, https://CRAN.R-project.org/package=data.table.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.